Sample Efficient Actor-Critic with Experience Replay
نویسندگان
چکیده
This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.
منابع مشابه
The Reactor: A Sample-Efficient Actor-Critic Architecture
In this work we present a new reinforcement learning agent, called Reactor (for Retraceactor), based on an off-policy multi-step return actor-critic architecture. The agent uses a deep recurrent neural network for function approximation. The network outputs a target policy π (the actor), an action-value Q-function (the critic) evaluating the current policy π, and an estimated behavioural policy...
متن کاملSample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management
Deep reinforcement learning (RL) methods have significant potential for dialogue policy optimisation. However, they suffer from a poor performance in the early stages of learning. This is especially problematic for on-line learning with real users. Two approaches are introduced to tackle this problem. Firstly, to speed up the learning process, two sampleefficient neural networks algorithms: tru...
متن کاملMulti-Batch Experience Replay for Fast Convergence of Continuous Action Control
Policy gradient methods for direct policy optimization are widely considered to obtain optimal policies in continuous Markov decision process (MDP) environments. However, policy gradient methods require exponentially many samples as the dimension of the action space increases. Thus, off-policy learning with experience replay is proposed to enable the agent to learn by using samples of other pol...
متن کاملData-Based Reinforcement Learning Algorithm with Experience Replay for Solving Constrained Nonzero-Sum Differential Games
In this paper a partially model-free reinforcement learning (RL) algorithm based on experience replay is developed for finding online the Nash equilibrium solution of the multi-player nonzero-sum (NZS) differential games. In order to avoid the performance degradation or even system instability, the amplitude limitation on the control inputs is considered in the design procedure. The proposed al...
متن کاملCombining policy gradient and Q-learning
Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1611.01224 شماره
صفحات -
تاریخ انتشار 2016